Hierarchical Character-Word Models for Language Identification

نویسندگان

  • Aaron Jaech
  • George Mulcaire
  • Shobhit Hathi
  • Mari Ostendorf
  • Noah A. Smith
چکیده

Social media messages’ brevity and unconventional spelling pose a challenge to language identification. We introduce a hierarchical model that learns character and contextualized word-level representations for language identification. Our method performs well against strong baselines, and can also reveal code-switching.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigation on language modelling approaches for open vocabulary speech recognition

By definition, words that are not present in a recognition vocabulary are called out-of-vocabulary (OOV) words. Recognition of unseen or new words is an important feature that is always desired in any real-world large vocabulary continuous speech recognition (LVCSR) system. However, human languages are complex in nature due to wide varieties of morphological richness such as inflections, deriva...

متن کامل

Language modeling of Chinese personal names based on character units for continuous Chinese speech recognition

In this paper, we analyze Chinese personal names to model their statistical phonotactic characteristics for continuous Chinese speech recognition. The analysis showed languagespecific characteristics of Chinese personal names and strongly suggested the advantage of character-unit oriented modeling. A hierarchical language model was composed by reflecting statistical phonotactic characteristics ...

متن کامل

Language Identification of Bengali-English Code-Mixed data using Character&Phonetic based LSTM Models

Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource BengaliEnglish code-mixed data taken from social media. We employ two methods of word encoding, namely character based an...

متن کامل

Topic Modeling of Chinese Language Using Character-Word Relations

Topic models are hierarchical Bayesian models for language modeling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages, the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research...

متن کامل

Automatic identification of language varieties: The case of Portuguese

Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016